Regression with a Categorical Independent Variable
Regression
GLM
Incorporating categorical predictors (i.e., factor variables) into a regression model using dummy variables.
General Principles
To study the relationship between a categorical independent variable and a continuous dependent variable, we use a Categorical model which applies stratification.
Stratification involves modeling how the k different categories of the independent variable affect the target continuous variable by performing a regression for each k category and assigning a regression coefficient for each category. To implement stratification, categorical variables are often encoded using one-hot encoding 🛈 or by converting categories to indices 🛈.
As we generate regression coefficients for each k category, we need to specify a prior with a shape equal to the number of categories k in the code (see comments in the code).
To compare differences between categories, we need to compute the distribution of the differences between categories, known as the contrast distribution. Never compare confidence intervals or p-values directly.
Example
Below is an example of code that demonstrates Bayesian regression with an independent categorical variable using the BayesForge (BF) package. The data consist of one continuous dependent variable (kcal_per_g), representing the caloric value of milk per gram, a categorical independent variable (index_clade), representing species clade membership, and a continuous independent variable (mass), representing the mass of individuals in the clade. The goal is to estimate the differences in milk calories between clades. This example is based on McElreath (2018).
from BayesForge import bf# Setup device------------------------------------------------m = bf(platform='cpu')# Import Data & Data Manipulation ------------------------------------------------# Importfrom importlib.resources import filesdata_path = m.load.milk(only_path =True)m.data(data_path, sep=';') m.index(["clade"]) # Convert clade names into indexm.scale(['kcal_per_g']) # Scale# Define model ------------------------------------------------def model(kcal_per_g, index_clade, mass): a = m.dist.normal(0, 0.5, shape=(4,), name ='a') # shape based on the number of clades b = m.dist.normal(0, 0.5, shape=(4,), name ='b') s = m.dist.exponential( 1, name ='s') mu = a[index_clade]+b[index_clade]*mass m.dist.normal(mu, s, obs=kcal_per_g)# Run mcmc ------------------------------------------------m.fit(model, progress_bar=False) # Optimize model parameters through MCMC sampling# Summary ------------------------------------------------m.summary()
/home/sosa/work/3.12venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning:
IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
bf v 0.0.48 package loaded
jax.local_device_count 32
mean
sd
hdi_5.5%
hdi_94.5%
mcse_mean
mcse_sd
ess_bulk
ess_tail
r_hat
a[0]
-0.34
0.33
-0.83
0.22
0.01
0.00
2866.52
2920.12
1.0
a[1]
0.60
0.29
0.15
1.08
0.01
0.00
3235.89
2685.49
1.0
a[2]
0.31
0.39
-0.29
0.94
0.01
0.01
2643.88
2599.97
1.0
a[3]
-0.17
0.44
-0.89
0.53
0.01
0.01
2852.61
2653.21
1.0
b[0]
-0.00
0.01
-0.02
0.01
0.00
0.00
3096.36
2876.63
1.0
b[1]
-0.18
0.12
-0.36
0.01
0.00
0.00
3178.46
2944.88
1.0
b[2]
0.08
0.07
-0.02
0.19
0.00
0.00
2677.26
2456.24
1.0
b[3]
-0.28
0.25
-0.68
0.11
0.00
0.00
2855.26
2808.28
1.0
s
0.77
0.12
0.59
0.96
0.00
0.00
2859.19
2708.14
1.0
library(BayesForge)m=importBF(platform='cpu')# Load csv filem$data(m$load$milk(only_path = T), sep=';')m$scale(list('kcal.per.g')) # Manipulatem$index(list('clade')) # Scalem$data_to_model(list('kcal_per_g', 'index_clade')) # Send to model (convert to jax array)# Define model ------------------------------------------------model <-function(kcal_per_g, index_clade){# Parameter prior distributions beta =bf.dist.normal( 0, 0.5, name ='beta', shape =c(4)) # shape based on the number of clades sigma =bf.dist.exponential(1, name ='s')# Likelihoodbf.dist.normal(beta[index_clade], sigma, obs=kcal_per_g)}# Run mcmc ------------------------------------------------m$fit(model) # Optimize model parameters through MCMC sampling# Summary ------------------------------------------------m$summary() # Get posterior distributions
usingBayesForge# Setup device------------------------------------------------m =importBF(platform="cpu")# Import Data & Data Manipulation ------------------------------------------------# Importdata_path = m.load.milk(only_path =true)m.data(data_path, sep=';')m.index("clade") # Convert clade names into indexm.scale(["kcal_per_g"]) # Scale# Define model ------------------------------------------------@BFfunctionmodel(kcal_per_g, index_clade, mass) a = m.dist.normal(0, 0.5, shape=(4,), name ="a") # shape based on the number of clades b = m.dist.normal(0, 0.5, shape=(4,), name ="b") s = m.dist.exponential( 1, name ='s') mu = a[index_clade]+b[index_clade]*mass m.dist.normal(mu, s, obs=kcal_per_g)end# Run mcmc ------------------------------------------------m.fit(model) # Optimize model parameters through MCMC sampling# Summary ------------------------------------------------m.summary() # Get posterior distributions
Caution
For R users, when working with indices you have to ensure 1) that indices are intergers (i.e. as.integer(index_clade)) and, 2) that indices start at 0 (i.e. as.integer(index_clade)-1).
Mathematical Details
Frequentist formulation
We model the relationship between the categorical input feature (X) and the target variable (Y) using the following equation:
Y_i = \alpha + \beta_k X_i + \sigma
Where:
Y_i is the dependent variable for observation i.
\alpha is the intercept term.
\beta_k are the regression coefficients for each k category.
X_i is the encoded categorical input variable for observation i.
\sigma is the error term.
We can interpret \beta_i as the effect of each category on Y relative to the baseline (usually one of the categories or the intercept).
Bayesian formulation
In the Bayesian formulation, we define each parameter with priors 🛈. We can express the Bayesian regression model accounting for prior distributions as follows:
Y \sim \text{Normal}(\alpha + \beta_K X, \sigma)
\alpha \sim \text{Normal}(0,1)
\beta_K \sim \text{Normal}(0,1)
\sigma \sim \text{Exponential}(1)
Where:
Y_i is the dependent variable for observation i.
\alpha is the intercept term, which in this case has a unit-normal prior.
\beta_K are slope coefficients for the K distinct independent variables categories, which also have unit-normal priors.
X_i is the encoded categorical input variable for observation i.
\sigma is a standard deviation parameter, which here has a Exponential prior that constrains it to be positive.
Notes
Note
As for Multiple continuous variables, we can extend the linear regression model to include multiple categorical predictors.